Random Forest variable importance with missing data

نویسندگان

Alexander Hapfelmeier

Kurt Ulm

Torsten Hothorn

چکیده

Random Forests are commonly applied for data prediction and interpretation. The latter purpose is supported by variable importance measures that rate the relevance of predictors. Yet existing measures can not be computed when data contains missing values. Possible solutions are given by imputation methods, complete case analysis and a newly suggested importance measure. However, it is unknown to what extend these approaches are able to provide a reliable estimate of a variables relevance. An extensive simulation study was performed to investigate this property for a variety of missing data generating processes. Findings and recommendations: Complete case analysis should not be applied as it inappropriately penalized variables that were completely observed. The new importance measure is much more capable to reflect decreased information exclusively for variables with missing values and should therefore be used to evaluate actual data situations. By contrast, multiple imputation allows for an estimation of importances one would potentially observe in complete data situations.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Variable selection with Random Forests for missing data

Variable selection has been suggested for Random Forests to improve their efficiency of data prediction and interpretation. However, its basic element, i.e. variable importance measures, can not be computed straightforward when there is missing data. Therefore an extensive simulation study has been conducted to explore possible solutions, i.e. multiple imputation, complete case analysis and a n...

متن کامل

Using Random Forests and Fuzzy Logic for Automated Storm Type Identification

This paper discusses how random forests, ensembles of weakly-correlated decision trees, can be used in concert with fuzzy logic concepts to both classify storm types based on a number of radar-derived storm characteristics and provide a measure of “confidence” in the resulting classifications. The random forest technique provides measures of variable importance and interactions, as well as meth...

متن کامل

Variable Selection from Random Forests: Application to Gene Expression Data

Random forest is a classification algorithm well suited for microarray data: it shows excellent performance even when most predictive variables are noise, can be used when the number of variables is much larger than the number of observations, and returns measures of variable importance. Thus, it is important to understand the performance of random forest with microarray data and its use for ge...

متن کامل

Comparison of Random Forest and Parametric Imputation Models for Imputing Missing Data Using MICE: A CALIBER Study

Multivariate imputation by chained equations (MICE) is commonly used for imputing missing data in epidemiologic research. The "true" imputation model may contain nonlinearities which are not included in default imputation models. Random forest imputation is a machine learning technique which can accommodate nonlinearities and interactions and does not require a particular regression model to be...

متن کامل

Random forest Gini importance favours SNPs with large minor allele frequency: impact, sources and recommendations

The use of random forests is increasingly common in genetic association studies. The variable importance measure (VIM) that is automatically calculated as a by-product of the algorithm is often used to rank polymorphisms with respect to their ability to predict the investigated phenotype. Here, we investigate a characteristic of this methodology that may be considered as an important pitfall, n...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2012

Random Forest variable importance with missing data

نویسندگان

چکیده

منابع مشابه

Variable selection with Random Forests for missing data

Using Random Forests and Fuzzy Logic for Automated Storm Type Identification

Variable Selection from Random Forests: Application to Gene Expression Data

Comparison of Random Forest and Parametric Imputation Models for Imputing Missing Data Using MICE: A CALIBER Study

Random forest Gini importance favours SNPs with large minor allele frequency: impact, sources and recommendations

عنوان ژورنال:

اشتراک گذاری